-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special case str
dtype in array creation
#2323
Special case str
dtype in array creation
#2323
Conversation
One thing to figure out: what do we want for the
Right now it's |
src/zarr/core/common.py
Outdated
@@ -162,3 +164,10 @@ def parse_order(data: Any) -> Literal["C", "F"]: | |||
if data in ("C", "F"): | |||
return cast(Literal["C", "F"], data) | |||
raise ValueError(f"Expected one of ('C', 'F'), got {data} instead.") | |||
|
|||
|
|||
def parse_dtype(dtype: Any) -> np.dtype[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type could be greatly improved. Really, we want the same signature as np.dtype
, but with an overload for str -> dtype(object)
. But maybe as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
However, I'd love if if we could scope this just to V2 and not touch V3.
src/zarr/core/metadata/v3.py
Outdated
@@ -504,6 +504,7 @@ class DataType(Enum): | |||
complex128 = "complex128" | |||
string = "string" | |||
bytes = "bytes" | |||
object = "object" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a dealbreaker to me. Python objects are not an allowable V3 datatype. Eliminating Python objects was one of the main goals of the V3 evolution!
In #2036 I did a lot of work towards more nuanced string support in V3, and that is now mostly working with Xarray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python objects are not an allowable V3 datatype.
seconding this -- we definitely don't want an object dtype for v3 data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What behavior do people want, in terms of the in-memory representation and the on-disk metadata? Is this correct?
- No change for
zarr_format=2
. We use object dtype in-memory and|O
orobject
in the metadata document. - We use the new variable-width string handling for
zarr_format=3
.StringDtype()
(maybe with a fallback to object if NumPy<2?) in memory andstring
in the metadata document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is what I want. I believe 2 is already implemented and tested for V3 here:
zarr-python/tests/v3/test_codecs/test_vlen.py
Lines 49 to 52 in aa46b45
a[:, :] = data | |
assert np.array_equal(data, a[:, :]) | |
assert a.metadata.data_type == DataType.string | |
assert a.dtype == expected_zarr_string_dtype |
I didn't touch V2 however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f85bb19
to
7e76e9e
Compare
This is still not quite working with Xarray the way I was hoping. Trying to track down why. This is with this branch merged to import xarray as xr
import zarr
import numpy as np
ds = xr.Dataset({"strings": ("b", np.array(["ab", "cdef", "g"], dtype=object))})
store = zarr.storage.MemoryStore({}, mode="w")
ds.to_zarr(store, zarr_version=2)
zarr.open_group(store, zarr_version=2, mode="r")["strings"].dtype
# -> dtype('<U') With Zarr 2.18.3 ds = xr.Dataset({"strings": ("b", np.array(["ab", "cdef", "g"], dtype=object))})
store = zarr.storage.MemoryStore({})
ds.to_zarr(store)
zarr.open_group(store, mode="r")["strings"].dtype
# -> dtype('O') |
…pecial-case' into tom/fix/dtype-str-special-case
@rabernat d8f24a8 fixes the issue you raised in #2323 (comment) when using this through xarray. We also need to special case With an updated
Ah, unfortunately reading the data doesn't quite work yet: In [9]: zarr.open_group(store, mode="r")["strings"][:] eventually raises with File ~/gh/zarr-developers/zarr-python/src/zarr/codecs/_v2.py:40, in V2Compressor._decode_single(self, chunk_bytes, chunk_spec)
38 # ensure correct dtype
39 if str(chunk_numpy_array.dtype) != chunk_spec.dtype:
---> 40 chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
42 return get_ndbuffer_class().from_numpy_array(chunk_numpy_array)
File ~/gh/pydata/xarray/.direnv/python-3.12/lib/python3.12/site-packages/numpy/_core/_internal.py:564, in _view_is_safe(oldtype, newtype)
561 return
563 if newtype.hasobject or oldtype.hasobject:
--> 564 raise TypeError("Cannot change data-type for array of references.")
565 return
TypeError: Cannot change data-type for array of references. I'll look into that too. |
Any codec experts want to chime in on whether 509a5c1 is appropriate? If I just remove that |
If you leave the |
Yeah, the traceback above has the relevant bit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good! As a non-string-dtype user I'm pretty surprised by the complexity involved in getting strings working in v3. Do you think we should have a section of the docs that specifically covers strings in zarr 2 / 3? That would be a separate effort from this PR of course.
Happy to write up those docs. One question on the intended behavior for v3. AFAICT, we don't support fixed-width, unicode strings. That uses In [1]: import zarr
In [2]: arr = zarr.create(shape=(3,), dtype="U3")
In [3]: arr[:] = ['a', 'bb', 'ccc']
In [4]: arr[:]
Out[4]: array(['a', 'bb', 'ccc'], dtype=StringDType()) There are some advantages to fixed-width strings when you know you have fixed-width data. Do we want to try to support that? |
Yes, I think it would be nice to support fixed-width data. In principle this can be done with just raw byte / int arrays, right? Logical vs. physical dtype etc. |
Can we move the fixed-width string conversation to a separate issue so we can merge this? |
Yep, definitely. |
Special cases
str
, interpreting it asnp.dtype("object")
, to match zarr-python 2.x's behavior.Closes #2315
TODO: